library(tidyverse)
library(psych)
library(lavaan)
library(lme4)
library(data.table)
library(readxl)
library(magrittr)
library(haven)
library(stargazer)
library(GGally)
library(car)
library(kableExtra)
library(yarrr)
library(lsmeans)
library(broom)
library(multcomp)
library(lsr)
library(apaTables)
library(MBESS)
Every once and a while R releases a new version of the software. This will not download automagically, so you should check the r website https://www.r-project.org/ from time to time, or follow them on twitter. You can see what version you are currently on by entering version into the console.
#install.packages()
#require()
#library()
Calling ?function allows you to look at the arguments for a function as well as their defaults. This is really helpful for when using new functions, knowing what abilities a function has, or for troubleshooting.
?read.csv
There are three main ways to read in csv files. First, you can use base R with the function read.csv. Alternatively you can use the read_csv function from the readr package. The final way is with the fread function from the data.table package.
# Base R
iris.1<- read.csv("./data/iris.csv",row.names=NULL)
# Readr
iris.2<- read_csv("./data/iris.csv")
# data.table
iris.3<- fread("./data/iris.csv")
For excel files, you can use the readxl package. Note: if you have multiple sheets you will need to specify which sheet you want to read in (i.e., read_excel("<filename>",sheet = "<sheetname>")
readxl_example("deaths.xlsx")
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/readxl/extdata/deaths.xlsx"
Deaths<-read_excel(readxl_example("deaths.xlsx"))
In this example we are downloading movie data from GitHub. Note that this isn’t web scraping, this is merely accessing a previously compiled dataset, which in this case is in csv format. Reading in data from a website is helpful if you are accessing archival datasets etc. We won’t get into APIs just yet but this is the first step in that direction.
movies <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_score_comparison.csv",header=TRUE)
head(movies)
Sometimes data is saved in a tabular format where values are separated by tabs or some other delimiter. We can use read.table which allows us to specify how the file is delimited.
insect<- read.table(file = 'http://www.ucd.ie/ecomodel/Resources/INSECT.TXT',
sep = '\t',
header = TRUE,
skip=3)
For SPSS files I prefer to use the haven package. Another alternative is the foreign package, but it is a little less consistent than I would prefer.
survey<- read_sav("~/Downloads/survey.sav")
Getting a sense of the structure and characteristics of a dataset is an important starting step. There are a few different ways we can examine a dataset.
head, or the last few rows with tail. It defaults to 10 rows, but you can adjust that by passing a number to the second argument.head(iris.1)
tail(iris.1)
head(iris.1,20)
We can use describe from the psych package.
psych::describe(iris.1)
We can print a descriptives table and get a headstart on putting together our final documentation with stargazer.
stargazer(iris.1,out="./output/iris.1.table.htm")
##
## % Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
## % Date and time: Mon, Jan 27, 2020 - 00:27:00
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lccccccc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Pctl(25)} & \multicolumn{1}{c}{Pctl(75)} & \multicolumn{1}{c}{Max} \\
## \hline \\[-1.8ex]
## X & 150 & 75.500 & 43.445 & 1 & 38.2 & 112.8 & 150 \\
## Sepal.Length & 150 & 5.843 & 0.828 & 4.300 & 5.100 & 6.400 & 7.900 \\
## Sepal.Width & 150 & 3.057 & 0.436 & 2.000 & 2.800 & 3.300 & 4.400 \\
## Petal.Length & 150 & 3.758 & 1.765 & 1.000 & 1.600 & 5.100 & 6.900 \\
## Petal.Width & 150 & 1.199 & 0.762 & 0.100 & 0.300 & 1.800 & 2.500 \\
## \hline \\[-1.8ex]
## \end{tabular}
## \end{table}
We can use summary from base r.
summary(iris.1)
## X Sepal.Length Sepal.Width Petal.Length
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.057 Mean :3.758
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## Petal.Width Species
## Min. :0.100 setosa :50
## 1st Qu.:0.300 versicolor:50
## Median :1.300 virginica :50
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
The Tidyverse is a collection of packages for reading, manipulating, analyzing, visualizing, and reporting in R. It has gained widespread popularity and rivals base R. Many people either strictly adhere to tidyverse packages, or utilize base R. For this reason, you may see people do the same thing in different ways, and they may recommend different solutions based on their preferred packages. Ultimately this points out one of the advantages of R which is that there are no right answers.
I think that tidyverse is untuitive, consistent across packages, flexible, and in many instances better performing (i.e., speed) than base R. Even if you choose not to use Tidyverse packages I think some of the functions are helpful to know.
Lets start with the basic pipe operator %>%. The pipe operator allows you to run multiple functions on a dataset in succession while cutting down on the number of lines of code and cleaning up your codes appearance. I like to think of the pipe as meaning “and then”.
Here we are calling the iris.1 data set and then grouping that dataset by Species, and then summarising those groups by calculating the mean. Because I do not assign this to a variable the output prints to the console without changing the data. One thing you will notice is that the pipe passes the data at the beginning to the next argument, so I am not specifying a dataset for the two functions, because it knows i’m referring to iris.1. Also remember that the functions are applied consecutively. So its like the results of each row are stored in a temporary dataset and you performing the next set of functions on that new dataset (not the original one).
iris.1 %>%
group_by(Species) %>%
summarise(mean = mean(Petal.Length))
With the %<>% assignment pipe, we can save the results to an object. This is equivalent to iris.1<- iris.1. Here
iris.1 %<>%
group_by(Species) %>%
summarise(mean = mean(Petal.Length))
###Dplyr
Dplyr is a package containing functions that allow us to manipulate data. Most if not all of the Dplyr verbs have a selection of scoped variants including _if, _at, and _all. Each function has some of its own unique scoped variants as well.
Filter allows us to.. well.. filter our data based on a formula that we define. You can use one criterion, you can specify that it filters on one criterion or another filter(Sepal.Length >= 5 | Sepal.Width < 3), or you can apply two filter conditions filter(Sepal.Length >= 5 & Sepal.Width < 3).
iris.2 %>%
filter(Sepal.Length >= 5)
The select function allows us to remove columns, or pick out columns for a new dataset.
You can either pass a list of column names (no quotes necessary), you can use : and specify the first and last column of a range, or you can negative subset to remove columns.
Select comes with some unique modifiers which are useful for dealing with named columns. This includes Starts_with,Ends_with, and contains.
demographics<-survey %>%
dplyr::select(id:educ)
demographics<- survey %>%
dplyr::select(-source)
iris %>%
dplyr::select(starts_with("Sepal"))
Select is also valuable for reordering columns
iris %>%
dplyr::select(Sepal.Length,everything())
The mutate function allows us to compute new variables. By default mutate iterates over columns. You can make it iterate over rows with rowwise(), however this is no longer supported by people such as Hadley Wickham, who suggest using purrr instead.
survey %>%
mutate(year = 2015) %>%
dplyr::select(id,year)
Transmute is similar to mutate but it only keeps the new variables by default.
survey %>%
rowwise() %>%
transmute(op = mean(c(op1,op2,op3,op4,op5,op6,na.rm = TRUE)))
iris %>%
mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))
reliability.data<- read_tsv("./data/data.csv")
Extraversion<- dplyr::select(reliability.data,starts_with("E"),-engnat)
Neuroticism<- dplyr::select(reliability.data,starts_with("N"))
Agreeableness<- dplyr::select(reliability.data,starts_with("A"),-age)
Openness<- dplyr::select(reliability.data,starts_with("O"))
Conscientiousness<- dplyr::select(reliability.data,starts_with("C"),-country)
measures<-list(Extraversion,Neuroticism,Agreeableness,Openness,Conscientiousness)
map(measures,psych::alpha)
## Some items ( E2 E4 E6 E8 E10 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( N2 N4 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( A1 A3 A5 A7 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( O2 O4 O6 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( C2 C4 C6 C8 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## [[1]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## -0.39 -0.39 0.26 -0.029 -0.28 0.016 3.1 0.35 -0.34
##
## lower alpha upper 95% confidence boundaries
## -0.42 -0.39 -0.36
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## E1 -0.49 -0.44 0.23 -0.035 -0.31 0.017 0.22 -0.34
## E2 -0.28 -0.32 0.28 -0.028 -0.24 0.014 0.21 -0.34
## E3 -0.48 -0.44 0.22 -0.035 -0.31 0.017 0.21 -0.34
## E4 -0.24 -0.26 0.31 -0.023 -0.21 0.014 0.21 -0.34
## E5 -0.36 -0.33 0.26 -0.028 -0.25 0.016 0.20 -0.34
## E6 -0.35 -0.38 0.26 -0.031 -0.27 0.015 0.23 -0.34
## E7 -0.46 -0.44 0.20 -0.035 -0.31 0.017 0.20 -0.34
## E8 -0.27 -0.30 0.31 -0.026 -0.23 0.014 0.23 -0.36
## E9 -0.34 -0.32 0.28 -0.028 -0.24 0.016 0.23 -0.34
## E10 -0.19 -0.22 0.34 -0.021 -0.18 0.013 0.22 -0.34
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## E1 19719 0.38 0.36 0.2892 0.030 2.6 1.2
## E2 19719 0.22 0.25 0.1153 -0.152 2.8 1.3
## E3 19719 0.37 0.35 0.3026 0.024 3.4 1.2
## E4 19719 0.15 0.20 0.0047 -0.192 3.2 1.2
## E5 19719 0.29 0.26 0.1814 -0.079 3.4 1.3
## E6 19719 0.27 0.30 0.1578 -0.087 2.5 1.2
## E7 19719 0.40 0.36 0.3807 -0.008 2.9 1.4
## E8 19719 0.20 0.23 -0.0179 -0.163 3.4 1.3
## E9 19719 0.30 0.26 0.0656 -0.100 3.1 1.4
## E10 19719 0.13 0.16 -0.1046 -0.236 3.6 1.3
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## E1 0 0.24 0.23 0.28 0.18 0.07 0
## E2 0 0.21 0.24 0.24 0.18 0.13 0
## E3 0 0.08 0.17 0.24 0.28 0.23 0
## E4 0 0.10 0.21 0.28 0.25 0.16 0
## E5 0 0.09 0.16 0.21 0.28 0.25 0
## E6 0 0.26 0.32 0.19 0.15 0.08 0
## E7 0 0.23 0.21 0.18 0.19 0.18 0
## E8 0 0.09 0.19 0.23 0.26 0.24 0
## E9 0 0.17 0.20 0.19 0.23 0.21 0
## E10 0 0.08 0.16 0.19 0.25 0.33 0
##
## [[2]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.71 0.69 0.79 0.19 2.3 0.0026 3.1 0.67 0.38
##
## lower alpha upper 95% confidence boundaries
## 0.71 0.71 0.72
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## N1 0.66 0.63 0.74 0.16 1.7 0.0031 0.142 0.36
## N2 0.80 0.80 0.84 0.30 3.9 0.0019 0.095 0.41
## N3 0.68 0.65 0.76 0.17 1.9 0.0030 0.150 0.39
## N4 0.79 0.78 0.83 0.28 3.5 0.0019 0.125 0.41
## N5 0.67 0.64 0.76 0.17 1.8 0.0030 0.159 0.39
## N6 0.64 0.61 0.73 0.15 1.6 0.0033 0.139 0.36
## N7 0.64 0.62 0.72 0.15 1.6 0.0033 0.142 0.37
## N8 0.63 0.61 0.71 0.15 1.6 0.0034 0.138 0.36
## N9 0.64 0.62 0.74 0.15 1.6 0.0033 0.145 0.36
## N10 0.67 0.65 0.75 0.17 1.8 0.0030 0.145 0.36
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## N1 19719 0.68 0.68 0.65 0.56 3.3 1.3
## N2 19719 -0.32 -0.30 -0.50 -0.46 3.2 1.2
## N3 19719 0.60 0.61 0.55 0.48 3.8 1.1
## N4 19719 -0.14 -0.13 -0.32 -0.31 2.8 1.2
## N5 19719 0.64 0.64 0.57 0.51 3.0 1.3
## N6 19719 0.77 0.77 0.76 0.67 3.0 1.3
## N7 19719 0.76 0.75 0.76 0.65 3.2 1.3
## N8 19719 0.78 0.77 0.79 0.68 2.8 1.4
## N9 19719 0.74 0.74 0.72 0.63 3.1 1.3
## N10 19719 0.64 0.63 0.59 0.50 2.8 1.3
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## N1 0 0.11 0.20 0.22 0.25 0.22 0
## N2 0 0.08 0.20 0.28 0.28 0.16 0
## N3 0 0.04 0.11 0.16 0.35 0.34 0
## N4 0 0.17 0.27 0.27 0.18 0.10 0
## N5 0 0.15 0.25 0.23 0.24 0.13 0
## N6 0 0.16 0.24 0.22 0.22 0.16 0
## N7 0 0.12 0.22 0.22 0.25 0.19 0
## N8 0 0.22 0.24 0.20 0.20 0.14 0
## N9 0 0.13 0.22 0.21 0.27 0.17 0
## N10 0 0.19 0.25 0.23 0.20 0.13 0
##
## [[3]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## -0.024 0.053 0.38 0.0055 0.055 0.011 3.2 0.35 -0.15
##
## lower alpha upper 95% confidence boundaries
## -0.05 -0.02 0
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## A1 0.103 0.168 0.46 0.0219 0.202 0.0093 0.15 0.044
## A2 -0.073 -0.024 0.32 -0.0026 -0.024 0.0120 0.13 -0.205
## A3 0.123 0.199 0.48 0.0269 0.249 0.0093 0.15 0.064
## A4 -0.176 -0.144 0.22 -0.0142 -0.126 0.0131 0.11 -0.174
## A5 0.185 0.274 0.49 0.0403 0.378 0.0085 0.12 0.032
## A6 -0.203 -0.147 0.26 -0.0144 -0.128 0.0134 0.14 -0.174
## A7 0.187 0.278 0.48 0.0410 0.385 0.0085 0.12 0.032
## A8 -0.161 -0.119 0.28 -0.0119 -0.106 0.0129 0.13 -0.180
## A9 -0.247 -0.207 0.19 -0.0194 -0.171 0.0139 0.12 -0.174
## A10 -0.181 -0.124 0.29 -0.0124 -0.110 0.0132 0.14 -0.205
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## A1 19719 0.24410 0.142 -0.18 -0.145 2.3 1.4
## A2 19719 0.36191 0.415 0.36 0.062 3.9 1.1
## A3 19719 0.14985 0.086 -0.29 -0.192 2.2 1.2
## A4 19719 0.47230 0.543 0.65 0.197 4.0 1.0
## A5 19719 0.00741 -0.063 -0.39 -0.300 2.2 1.1
## A6 19719 0.50483 0.546 0.54 0.210 3.9 1.1
## A7 19719 -0.00089 -0.071 -0.37 -0.305 2.2 1.1
## A8 19719 0.45551 0.518 0.50 0.180 3.8 1.0
## A9 19719 0.54348 0.601 0.72 0.272 3.9 1.1
## A10 19719 0.47812 0.523 0.45 0.202 3.7 1.1
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## A1 0 0.39 0.25 0.13 0.14 0.10 0
## A2 0 0.03 0.08 0.17 0.34 0.37 0
## A3 0 0.40 0.26 0.17 0.13 0.05 0
## A4 0 0.03 0.07 0.14 0.36 0.40 0
## A5 0 0.34 0.35 0.16 0.10 0.05 0
## A6 0 0.04 0.09 0.18 0.31 0.38 0
## A7 0 0.34 0.35 0.17 0.10 0.05 0
## A8 0 0.03 0.09 0.22 0.39 0.26 0
## A9 0 0.04 0.08 0.14 0.37 0.37 0
## A10 0 0.03 0.09 0.29 0.34 0.25 0
##
## [[4]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.27 0.3 0.53 0.041 0.42 0.0075 3.3 0.38 0.14
##
## lower alpha upper 95% confidence boundaries
## 0.26 0.27 0.29
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## O1 0.075 0.12 0.39 0.016 0.14 0.0099 0.090 -0.00089
## O2 0.429 0.44 0.59 0.082 0.80 0.0057 0.086 0.19340
## O3 0.223 0.24 0.48 0.034 0.32 0.0083 0.095 -0.00089
## O4 0.381 0.41 0.58 0.071 0.68 0.0061 0.094 0.19340
## O5 0.133 0.14 0.41 0.017 0.16 0.0092 0.090 -0.00089
## O6 0.438 0.47 0.61 0.089 0.87 0.0058 0.088 0.19340
## O7 0.183 0.20 0.47 0.027 0.25 0.0087 0.099 0.01249
## O8 0.075 0.14 0.40 0.017 0.16 0.0100 0.095 -0.00089
## O9 0.218 0.24 0.52 0.035 0.32 0.0083 0.111 0.00937
## O10 0.144 0.15 0.41 0.019 0.18 0.0091 0.083 -0.00089
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## O1 19719 0.627 0.613 0.64 0.40 3.7 1.12
## O2 19719 0.011 -0.031 -0.27 -0.27 2.1 1.14
## O3 19719 0.403 0.430 0.35 0.15 4.1 1.01
## O4 19719 0.114 0.077 -0.15 -0.17 2.1 1.11
## O5 19719 0.552 0.597 0.61 0.35 3.9 0.94
## O6 19719 -0.064 -0.097 -0.37 -0.32 1.8 1.07
## O7 19719 0.465 0.505 0.42 0.25 4.1 0.92
## O8 19719 0.629 0.595 0.60 0.36 3.2 1.26
## O9 19719 0.409 0.429 0.25 0.17 4.1 0.98
## O10 19719 0.535 0.577 0.60 0.31 4.0 0.98
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## O1 0 0.05 0.10 0.24 0.33 0.28 0
## O2 0 0.36 0.31 0.19 0.09 0.04 0
## O3 0 0.02 0.06 0.15 0.31 0.46 0
## O4 0 0.39 0.29 0.21 0.07 0.04 0
## O5 0 0.02 0.05 0.25 0.40 0.28 0
## O6 0 0.53 0.27 0.11 0.06 0.04 0
## O7 0 0.01 0.05 0.17 0.40 0.38 0
## O8 0 0.12 0.18 0.25 0.27 0.18 0
## O9 0 0.02 0.05 0.14 0.34 0.44 0
## O10 0 0.02 0.06 0.20 0.34 0.38 0
##
## [[5]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.056 0.11 0.36 0.012 0.12 0.01 3.2 0.39 -0.16
##
## lower alpha upper 95% confidence boundaries
## 0.03 0.06 0.08
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## C1 -0.017 0.0077 0.28 0.00086 0.0078 0.0114 0.096 -0.163
## C2 0.105 0.1796 0.39 0.02374 0.2189 0.0097 0.100 0.016
## C3 -0.033 -0.0102 0.29 -0.00112 -0.0101 0.0116 0.111 -0.203
## C4 0.120 0.2053 0.41 0.02790 0.2584 0.0096 0.095 0.028
## C5 0.086 0.1008 0.34 0.01230 0.1121 0.0102 0.094 -0.163
## C6 0.140 0.2044 0.40 0.02775 0.2569 0.0092 0.093 0.028
## C7 -0.015 0.0129 0.30 0.00145 0.0130 0.0114 0.104 -0.163
## C8 0.162 0.2457 0.44 0.03493 0.3257 0.0093 0.102 0.028
## C9 -0.037 -0.0092 0.27 -0.00102 -0.0091 0.0115 0.096 -0.163
## C10 -0.056 -0.0343 0.27 -0.00370 -0.0332 0.0118 0.106 -0.180
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## C1 19719 0.39 0.457 0.419 0.120 3.3 1.1
## C2 19719 0.30 0.210 0.009 -0.052 3.0 1.4
## C3 19719 0.40 0.478 0.384 0.156 4.0 1.0
## C4 19719 0.24 0.165 -0.058 -0.083 2.7 1.2
## C5 19719 0.29 0.333 0.222 -0.032 2.7 1.2
## C6 19719 0.27 0.166 -0.032 -0.096 2.9 1.4
## C7 19719 0.40 0.451 0.368 0.112 3.6 1.2
## C8 19719 0.13 0.089 -0.228 -0.162 2.5 1.1
## C9 19719 0.44 0.477 0.461 0.132 3.2 1.2
## C10 19719 0.44 0.506 0.456 0.192 3.6 1.0
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## C1 0 0.06 0.17 0.29 0.34 0.14 0
## C2 0 0.19 0.21 0.19 0.24 0.16 0
## C3 0 0.02 0.07 0.17 0.37 0.36 0
## C4 0 0.21 0.29 0.24 0.18 0.09 0
## C5 0 0.21 0.26 0.26 0.18 0.10 0
## C6 0 0.21 0.23 0.17 0.22 0.17 0
## C7 0 0.06 0.11 0.23 0.33 0.27 0
## C8 0 0.24 0.26 0.32 0.13 0.05 0
## C9 0 0.11 0.19 0.25 0.28 0.18 0
## C10 0 0.03 0.09 0.31 0.35 0.22 0
This is done with the purrr package, which is the Tidyverse version of apply. It allows us to iterate over rows,columns,dataframes in a clean and concise manner.
Extraversion %>% mutate(score = pmap_dbl(.,function(...) mean(c(...))))
Creating a cleaning function
replace_over_4<- function(x) replace(x,x>4,NA)
health_tbl<- read_tsv("./data/w4inhome_dvn.tab") %>%
dplyr::select(mh1 = H4MH3, mh2 = H4MH4, mh3 = H4MH5, mh4 = H4MH6,
jailage = H4CJ20, gender = BIO_SEX4) %>%
mutate_at(vars(mh1:mh4),replace_over_4) %>%
mutate(mh1 = 4 - mh1,
mh4 = 4 - mh4) %>%
rowwise() %>%
mutate(mh = mean(c(mh1,mh2,mh3,mh4),na.rm=T)) %>%
ungroup() %>%
mutate(gender = factor(gender,levels =c(1,2),labels = c("male","female"))) %>%
filter(jailage < 97)
Visualization
dplyr::select(health_tbl, 5:7) %>% ggpairs()
dplyr::select(health_tbl, -gender) %>% cor(use ="pairwise")
## mh1 mh2 mh3 mh4 jailage mh
## mh1 1.0000000 0.18972172 0.33470223 0.4286938 0.07206430 0.74075325
## mh2 0.1897217 1.00000000 0.27916435 0.2129501 -0.03611862 0.57331576
## mh3 0.3347022 0.27916435 1.00000000 0.4123384 0.04372349 0.70121523
## mh4 0.4286938 0.21295010 0.41233840 1.0000000 0.02069340 0.75779718
## jailage 0.0720643 -0.03611862 0.04372349 0.0206934 1.00000000 0.04231706
## mh 0.7407533 0.57331576 0.70121523 0.7577972 0.04231706 1.00000000
Describe
dplyr::select(health_tbl,-gender) %>% apa.cor.table(.,filename="./output/Cortable.doc",table.number=1)
##
##
## Table 1
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3 4
## 1. mh1 2.47 1.21
##
## 2. mh2 2.98 0.97 .19**
## [.09, .28]
##
## 3. mh3 2.30 0.94 .33** .28**
## [.24, .42] [.18, .37]
##
## 4. mh4 2.56 1.14 .43** .21** .41**
## [.34, .51] [.12, .31] [.33, .49]
##
## 5. jailage 19.16 3.42 .07 -.04 .04 .02
## [-.03, .17] [-.14, .06] [-.06, .14] [-.08, .12]
##
## 6. mh 2.58 0.74 .74** .57** .70** .76**
## [.69, .78] [.50, .64] [.65, .75] [.71, .80]
##
## 5
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## .04
## [-.06, .14]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
Analysis
model_1<- lm(mh ~ jailage + gender, data = health_tbl)
par(mfrow = c(2,2))
plot(model_1)
par(mfrow=c(1,1))
tidy(summary(model_1))
model1_augment<-augment(model_1)
ggplot(model1_augment,aes(x=jailage, y=mh,color=gender)) +
geom_line(aes(y=.fitted))
Model 2
model2<- lm(mh ~ jailage + gender + jailage:gender, data=health_tbl)
par(mfrow=c(2,2))
plot(model2)
par(mfrow=c(1,1))
tidy(summary(model2))
ggplot(health_tbl, aes(x=jailage, y=mh, color=gender, group=gender)) +
geom_smooth(method= "lm",se=F)
Comparing Models
anova(model_1,model2)
Comparing Change in \(R^2\)
summary(model2)$r.squared-summary(model_1)$r.squared
## [1] 0.002434573
Data Import and Cleaning
health_tbl<- read_tsv("./data/w4inhome_dvn.tab") %>%
transmute(admin_month = IMONTH4,
gender = factor(BIO_SEX4,levels = c(1,2),labels= c("male","female")),
living_mother = factor(H4WP1,levels= c(0,1,8),labels=c("No","Yes","Don't Know")),
fiw = replace(H4LM29, H4LM29 >=6,NA))
Plotting
ggpairs(health_tbl)
Analysis
options(contrasts = c("contr.sum", "contr.poly"))
linear_model<- lm(fiw ~ admin_month + living_mother + gender, data = health_tbl)
anova_model<- Anova(linear_model,type = 3)
anova_model
kable(anova_model,digits = 3, out="./output/anova_model.htm")
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| (Intercept) | 3235.235 | 1 | 4298.538 | 0.000 |
| admin_month | 3.408 | 1 | 4.529 | 0.033 |
| living_mother | 2.441 | 2 | 1.622 | 0.198 |
| gender | 41.747 | 1 | 55.468 | 0.000 |
| Residuals | 3778.233 | 5020 | NA | NA |
Diagnostic Plots
plot(linear_model)
pirateplot(formula = "fiw ~ living_mother + gender",data = health_tbl)
Marginal Means Plot
mm_df<- tidy(lsmeans(linear_model,"gender",by="living_mother"))
ggplot(mm_df,
aes(x=living_mother,
y=estimate,
color=gender,
group=gender)) +
geom_line()
Tukeys Post-hoc Tests
health_tbl %<>% mutate(cond=interaction(gender,living_mother,sep=" x "))
posthoc_model<- lm(fiw~admin_month + cond, data = health_tbl)
posthocs <- glht(posthoc_model, linfct=mcp(cond="Tukey"))
summary(posthocs)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = fiw ~ admin_month + cond, data = health_tbl)
##
## Linear Hypotheses:
## Estimate Std. Error t value
## female x No - male x No == 0 -0.32298 0.10918 -2.958
## male x Yes - male x No == 0 0.02272 0.08075 0.281
## female x Yes - male x No == 0 -0.15289 0.08052 -1.899
## male x Don't Know - male x No == 0 0.09060 0.20497 0.442
## female x Don't Know - male x No == 0 -0.14135 0.19725 -0.717
## male x Yes - female x No == 0 0.34571 0.07805 4.429
## female x Yes - female x No == 0 0.17009 0.07775 2.188
## male x Don't Know - female x No == 0 0.41359 0.20393 2.028
## female x Don't Know - female x No == 0 0.18163 0.19615 0.926
## female x Yes - male x Yes == 0 -0.17562 0.02541 -6.911
## male x Don't Know - male x Yes == 0 0.06788 0.19023 0.357
## female x Don't Know - male x Yes == 0 -0.16407 0.18186 -0.902
## male x Don't Know - female x Yes == 0 0.24349 0.19011 1.281
## female x Don't Know - female x Yes == 0 0.01154 0.18171 0.064
## female x Don't Know - male x Don't Know == 0 -0.23195 0.26186 -0.886
## Pr(>|t|)
## female x No - male x No == 0 0.027 *
## male x Yes - male x No == 0 1.000
## female x Yes - male x No == 0 0.341
## male x Don't Know - male x No == 0 0.997
## female x Don't Know - male x No == 0 0.973
## male x Yes - female x No == 0 <0.001 ***
## female x Yes - female x No == 0 0.196
## male x Don't Know - female x No == 0 0.270
## female x Don't Know - female x No == 0 0.922
## female x Yes - male x Yes == 0 <0.001 ***
## male x Don't Know - male x Yes == 0 0.999
## female x Don't Know - male x Yes == 0 0.929
## male x Don't Know - female x Yes == 0 0.750
## female x Don't Know - female x Yes == 0 1.000
## female x Don't Know - male x Don't Know == 0 0.934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Effect size estimates
etaSquared(linear_model, type=3,anova=T)
## eta.sq eta.sq.part SS df MS F
## admin_month 0.0008912691 0.0009012830 3.408329 1 3.4083293 4.528522
## living_mother 0.0006383947 0.0006457325 2.441305 2 1.2206523 1.621836
## gender 0.0109167729 0.0109286326 41.747165 1 41.7471646 55.467924
## Residuals 0.9879980344 NA 3778.233464 5020 0.7526361 NA
## p
## admin_month 3.338298e-02
## living_mother 1.976392e-01
## gender 1.112443e-13
## Residuals NA
Publication worthy APA table
apa.aov.table(linear_model,conf.level=95,type=3,"./output/analysis.doc")
##
##
## ANOVA results using fiw as the dependent variable
##
##
## Predictor SS df MS F p partial_eta2
## (Intercept) 3235.24 1 3235.24 4298.54 .000
## admin_month 3.41 1 3.41 4.53 .033 .00
## living_mother 2.44 2 1.22 1.62 .198 .00
## gender 41.75 1 41.75 55.47 .000 .01
## Error 3778.23 5020 0.75
## CI_90_partial_eta2
##
## [.00, .00]
## [.00, .00]
## [.01, .02]
##
##
## Note: Values in square brackets indicate the bounds of the 90% confidence interval for partial eta-squared
Planned Contrasts
linear_model_lsm<- lsmeans(linear_model,"living_mother")
contrast(linear_model_lsm, list(knowledge=c(-.5,-.5,1)))
## contrast estimate SE df t.ratio p.value
## knowledge 0.0879 0.134 5020 0.657 0.5111
##
## Results are averaged over the levels of: gender